Analyzing the Continuing Impact of Redlining on Baltimore's Housing Market

By Aaron Szabo

Last Updated: 12/16/2019

Introduction

Starting in the mid 1930’s, the Home Owner’s Loan Corporation (HOLC) started appraise building conditions in cities across the country. This information was used to generate maps that marked the levels of risk for residential mortgage lenders. Each region was given a letter grade and color. The highest grade is ā€œAā€ and was colored green on the maps. The lowest grade, representing the highest risks, is ā€œDā€ and was colored red. The problem is that these rankings were heavily influenced by racial discrimination. The supporting documents that came with the maps describes the ā€œDā€ ranked areas as follows:

"The fourth grade or D areas represent those neighborhoods in which the things that are now taking place in the C neighborhoods, have already happened. They are characterized by detrimental influences in a pronounced degree, undesirable population or an infiltration of it. Low percentage of home ownership, very poor maintenance and often vandalism prevail. Unstable incomes of the people and difficult collections are usually prevalent. The areas are broader than the so-called slum districts. Some mortgage lenders may refuse to make loans in these neighborhoods and others will lend only on a conservative basis."

– The Baltimore ā€œRedliningā€ Map: Ranking Neighborhoods
Redlining was coined to describe banks and other institutions practice of not investing in areas that were given a grade of ā€œDā€, i.e. colored red on the HOLC maps. Since minorities were specifically target by the appraisers, it was minority communities that disproportionally suffered from redlining. This form of economic racial discrimination is one of the main arguments for reparations. However, to begin the discussion of reparations, we first need to understand the current, ongoing, economic effects of redlining.

For more information regarding redlining see the National Community Reinvestment Coalition report on the continuing effects of redlining:
HOLC ā€œredliningā€ maps: The persistent structure of segregation and economic inequality
For a summary, see the Washington Post article about the report
Redlining was banned 50 years ago. It’s still hurting minorities today
Digitizations of the redlining maps are available at
Mapping Inequality
And for Baltimore specific information, see
The Baltimore ā€œRedliningā€ Map: Ranking Neighborhoods

The purpose of this project is to analyze the current effects of redlining on Baltimore City’s housing market. Towards this end, I analyzed four measures of the housing market, the percent of lots that are vacant, the percent of foreclosed units, the percent of units sold, and the median sales price. To determine the effects of redlining, I tested the quality of indicator the redlining map is of the above statistics.

Python Primer

The analysis was done using python and a few core packages; Pandas, GeoPandas, Folium, matplotlib, and StatsModels. (Each name is a link to documentation on the package) The data manipulation was done using Pandas and then exported to GeoPandas for easier integration with Folium, the package used to make the interactive maps. matplotlib was used to make the histograms of the data and the statistical analysis was done with StatsModels.

import pandas as pd
import geopandas as gpd
import folium
import shapely
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
import math
import matplotlib.pyplot as plt

The Data

The redlining data comes in the form of a digitized map from Mapping Inequality in the form of GeoJson, a standardized format to store geographic data.

The modern data comes from the Open Baltimore site, a collection of publicly available data about Baltimore. This data is organized by Census Tracts and Block Groups. (for more background see link) The geographies of each of these block groups was gotten from Census Bureau an converted into a GeoJson online using MyGeodata Cloud.

redline_data = gpd.read_file("MDBaltimore1937.geojson")
data = gpd.read_file("mygeodata/tl_2010_24510_bg10.geojson")
housing_data = pd.read_csv("2011_Housing_Market_Typology.csv")

Data Manipulation

The first step in the data manipulation process was adding a hex code color column for each of the redlining grade colors.

def redlining_colormap(holc_grade):
    if holc_grade == 'A':
        return "#00ff00"  # Green
    elif holc_grade == 'B':
        return "#0000ff"  # Blue
    elif holc_grade == 'C':
        return "#ffff00"  # Yellow
    elif holc_grade == 'D':
        return "#ff0000"  # Red
    else:
        return "#ffffff"
    
# Applys the above function to each element of redline_data['holc_grade']
redline_data['color'] = redline_data['holc_grade'].apply(redlining_colormap)  

In order to judge the effects of redlining, the overlap between each redlining area and each census block group was found and recorded as a percentage of the block group that was covered by each color of HOLC grading. The area of the overlap was calculated using shapely.

perc_green = []
perc_blue = []
perc_yellow = []
perc_red = []

# iterate through each row of the Census block group data
for index1,c_row in data.iterrows():
    c_district = c_row['geometry']  # the polygon representing the current Census block group
    c_area = c_district.area  # the area of the polygon
    is_green = False
    is_blue = False
    is_yellow = False
    is_red = False
    # iterate through each row of the redlining map data searching for overlaps
    for index2,r_row in redline_data.iterrows():
        r_area = r_row['geometry']
        # check if the block group and the redlining region overlap
        if c_district.intersects(r_area):
            # if they do overlap, get the area
            overlap_area = c_district.intersection(r_area).area
            # the overlap includes lines, which have an area of 0, so we will ignore them
            if (overlap_area > 0):
                if r_row['holc_grade'] == 'A':
                    if is_green:
                        perc_green[index1] += overlap_area/c_area
                    else:
                        is_green = True
                        perc_green.append(overlap_area/c_area)
                elif r_row['holc_grade'] == 'B':
                    if is_blue:
                        perc_blue[index1] += overlap_area/c_area
                    else:
                        is_blue = True
                        perc_blue.append(overlap_area/c_area)
                elif r_row['holc_grade'] == 'C':
                    if is_yellow:
                        perc_yellow[index1] += overlap_area/c_area
                    else:
                        is_yellow = True
                        perc_yellow.append(overlap_area/c_area)
                elif r_row['holc_grade'] == 'D':
                    if is_red:
                        perc_red[index1] += overlap_area/c_area
                    else:
                        is_red = True
                        perc_red.append(overlap_area/c_area)
    # add a 0 to the list of percent overlap of each color if there were no overlaps of that color
    if not is_green:
        perc_green.append(0.0)
    if not is_blue:
        perc_blue.append(0.0)
    if not is_yellow:
        perc_yellow.append(0.0)
    if not is_red:
        perc_red.append(0.0)
data['perc_green'] = perc_green
data['perc_blue'] = perc_blue
data['perc_yellow'] = perc_yellow
data['perc_red'] = perc_red

The GeoID in the Baltimore data does not come with preceding zeros, so those are added here.

housing_data['blockGroup'] = housing_data['blockGroup'].apply(lambda x: str(x) if (len(str(x)) == 7) else ('0'+str(x)))

30 out of the 653 Census block groups in Baltimore City do not have any associated data in the Baltimore data base used. Those areas are recorded here and saved to be used on the map below.

missing = []
# iterate through the Census block group data
for index,d_row in data.iterrows():
    found = False
    # iterate through the Open Baltimore housing data
    for index2,h_row in housing_data.iterrows():
        if ('24510' + h_row['blockGroup']) == d_row['GEOID10']:
            found = True
            break            
    if not found:
        missing.append(d_row['geometry'])

Here the data is converted into GeoPandas DataFrames for ease of integration with Folium.

# Created a copy of the data DataFrame to preserve its values
temp = com_data.copy()
# Convert to a GeoPandas DataFrame
census_geo = gpd.GeoDataFrame(temp, geometry=temp['geometry'])
# The CRS tells folium what projection the geometry's units are in
census_geo.crs = {'init': 'epsg:4269'}

# same as above
missing_df = pd.DataFrame(columns=['geometry'], data=missing)
missing_gdf = gpd.GeoDataFrame(missing_df, geometry=missing_df['geometry'])
missing_gdf.crs = {'init': 'epsg:4269'}

Redlining Map

Below is a map of Baltimore City overlaid with the HOLC redlining map and the Census block groups. The bright blue areas are the block groups missing in the Baltimore data. Each layer can be turned on and off using the layer control menu in the top right of the map.

# make the map and set the starting location
map_c = folium.Map(location=[39.29, -76.61], zoom_start=11)
# Add the redlining map layer
folium.GeoJson(redline_data, name='Redlining Map', style_function=lambda feature: {
    'fillColor': feature['properties']['color'],
    'color': feature['properties']['color'],
    'weight': 1,
    'fillOpacity': 0.5,
}).add_to(map_c)
# Add the matched Census block groups
folium.GeoJson(census_geo, name='Census Districts', style_function=lambda feature: {
    'fillColor': "#333333",
    'color': "#000000",
    'weight': 0.5,
    'fillOpacity': 0.1
}).add_to(map_c)
# Add the missing Census block groups
folium.GeoJson(missing_gdf, name='Missing Census Districts', style_function=lambda feature: {
    'fillColor': "#00ffff",
    'color': "#000000",
    'weight': 0.5,
    'fillOpacity': 0.5
}).add_to(map_c)
# Add the layer control menu
folium.LayerControl().add_to(map_c)
map_c  # show the map

The Data: Part 2

Here the map data generated above, recording the percent coverage of each HOLC grading value is merged with the Baltimore housing data. This is done using the GeoID, a 10 digit identifier used by the Census Bureau. (For more information, see this link)

The categories are as follows:

  • Ratio vacant: The number of vacant lots (as of July of 2010) divided by the number of residential lots plus the number of vacant lots
  • Ratio Foreclosed: The number of foreclosure filings (in 2009-2010) as a percent of privately owned residential lots
  • Ratio Sales: The number of residential sales (in 2009-2010) multiplied by the area (in sq miles) divided by the number of units per sq mile
  • Median Sale Price: Median residential sales price (in 2009-2010)

An interesting note, one Census block group had a foreclosure percentage of 450. This caused major problems with the statistical analysis and seemed erroneous, so I shifted that value to 100.

Due to the skewness of the data, the log of each category was also recorded. For more information regarding log transformation in linear models see this link

# create a new DataFrame which will hold the combined data
com_data = pd.DataFrame(columns=['GEOID10', 'geometry', 'area', 'perc_green', 'perc_blue', 'perc_yellow', 'perc_red', 'ratio_vacant', 'ratio_foreclosed', 'ratio_sales', 'median_sale_price', 'log_ratio_vacant', 'log_ratio_foreclosed', 'log_ratio_sales', 'log_median_sale_price'])
for index,d_row in data.iterrows():
    # the total area of the district in square miles (ALAND is in square meters)
    area = d_row['ALAND10']/2590000.0
    # data that can be direct copied from the Census data
    new_row = {'GEOID10': d_row['GEOID10'], 
              'geometry': d_row['geometry'],
              'area': area,
              'perc_green': d_row['perc_green'],
              'perc_blue': d_row['perc_blue'],
              'perc_yellow': d_row['perc_yellow'],
              'perc_red': d_row['perc_red']}
    # finding the corresponding entry in the Open Baltimore data
    for index2,h_row in housing_data.iterrows():
        if ('24510' + h_row['blockGroup']) == d_row['GEOID10']:
            new_row['ratio_vacant'] = h_row['vacantLots']
            if new_row['ratio_vacant'] == 0:
                new_row['log_ratio_vacant'] = 0
            else:
                new_row['log_ratio_vacant'] = math.log(new_row['ratio_vacant'])
            if h_row['foreclosureFilings'] > 100:  # dealing with the foreclosure percentage of 450
                new_row['ratio_foreclosed'] = 100
            else:
                new_row['ratio_foreclosed'] = h_row['foreclosureFilings']
            if new_row['ratio_foreclosed'] == 0:
                new_row['log_ratio_foreclosed'] = 0
            else:
                new_row['log_ratio_foreclosed'] = math.log(new_row['ratio_foreclosed'])
            if h_row['unitsPerSquareMile'] == 0:
                new_row['ratio_sales'] = 0
            else:
                # Adjusting the sales to control for number of units
                new_row['ratio_sales'] = h_row['sales20092010']/h_row['unitsPerSquareMile']*area
            if new_row['ratio_sales'] == 0:
                new_row['log_ratio_sales'] = 0
            else:
                new_row['log_ratio_sales'] = math.log(new_row['ratio_sales'])
            new_row['median_sale_price'] = h_row['medianSalesPrice20092010']
            if new_row['median_sale_price'] == 0:
                new_row['log_median_sale_price'] = 0
            else:
                new_row['log_median_sale_price'] = math.log(new_row['median_sale_price'])
            break
    com_data = com_data.append(pd.Series(new_row), ignore_index=True)
com_data
com_data = com_data.dropna()

Raw Data Analysis

Below is another map with layers showing each of the raw categories listed above. The legend for each layer is shown at the top of the map

map_d = folium.Map(location=[39.29, -76.61], zoom_start=11)
folium.Choropleth(
    geo_data=census_geo[['GEOID10', 'geometry']],
    name='Ratio of Vacancies',
    data=census_geo,
    columns=['GEOID10', 'ratio_vacant'],
    key_on='feature.properties.GEOID10',
    fill_color='Blues',
    fill_opacity=0.5,
    line_opacity=0.7,
    legend_name='Ratio of Vacancies',
    show=False).add_to(map_d)
folium.Choropleth(
    geo_data=census_geo[['GEOID10', 'geometry']],
    name='Ratio of Foreclosures',
    data=census_geo,
    columns=['GEOID10', 'ratio_foreclosed'],
    key_on='feature.properties.GEOID10',
    fill_color='OrRd',
    fill_opacity=0.7,
    line_opacity=0.7,
    legend_name='Ratio of Foreclosures',
    show=False).add_to(map_d)
folium.Choropleth(
    geo_data=census_geo[['GEOID10', 'geometry']],
    name='Ratio of Sales',
    data=census_geo,
    columns=['GEOID10', 'ratio_sales'],
    key_on='feature.properties.GEOID10',
    fill_color='Purples',
    fill_opacity=0.7,
    line_opacity=0.7,
    legend_name='Ratio of Sales',
    show=False).add_to(map_d)
folium.Choropleth(
    geo_data=census_geo[['GEOID10', 'geometry']],
    name='Median Sales Price',
    data=census_geo,
    columns=['GEOID10', 'median_sale_price'],
    key_on='feature.properties.GEOID10',
    fill_color='BuGn',
    fill_opacity=0.5,
    line_opacity=0.7,
    legend_name='Median Sales Price',
    show=False).add_to(map_d)
folium.GeoJson(redline_data, name='Redlining Map', style_function=lambda feature: {
    'fillColor': feature['properties']['color'],
    'color': feature['properties']['color'],
    'weight': 0.7,
    'fillOpacity': 0.3,
}).add_to(map_d)
folium.LayerControl().add_to(map_d)
map_d

Below is a statistical analysis of each raw category

Ratio Vacant

plt.hist(com_data['ratio_vacant'])
plt.title("Histogram of Ratio Vacant")
plt.xlabel("Percentage")
Text(0.5, 0, 'Percentage')
X = sm.add_constant(com_data[['perc_green', 'perc_blue', 'perc_yellow', 'perc_red']])
smmodel_v = sm.OLS(com_data['ratio_vacant'], X)
smfit_v = smmodel_v.fit()
print(smfit_v.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:           ratio_vacant   R-squared:                       0.095
Model:                            OLS   Adj. R-squared:                  0.089
Method:                 Least Squares   F-statistic:                     16.23
Date:                Mon, 16 Dec 2019   Prob (F-statistic):           1.19e-12
Time:                        06:01:31   Log-Likelihood:                -2392.4
No. Observations:                 623   AIC:                             4795.
Df Residuals:                     618   BIC:                             4817.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const           8.5384      1.026      8.319      0.000       6.523      10.554
perc_green     -6.9961      2.627     -2.663      0.008     -12.156      -1.837
perc_blue      -2.7583      1.493     -1.847      0.065      -5.691       0.174
perc_yellow    -1.5910      1.447     -1.099      0.272      -4.433       1.251
perc_red        7.3649      1.542      4.776      0.000       4.337      10.393
==============================================================================
Omnibus:                      470.478   Durbin-Watson:                   1.909
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             8264.386
Skew:                           3.235   Prob(JB):                         0.00
Kurtosis:                      19.628   Cond. No.                         7.06
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
C:\Users\Szabo\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py:2389: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
  return ptp(axis=axis, out=out, **kwargs)

The important result from this test is the probability of the F-statistic. This is the measure of the certainty of the models result. A probability of greater than 0.05 means that we cannot reject the null hypothesis that the independent variable, the ratio of vacant lots in this case, is independent of the dependent variables, the percent of the area of the block district that was covered by each HOLC grading grade.
In this case, the probability of the F-statistic is 1.19e-12. This number is effectively 0, meaning that there is a strong correlation between redlining map coverage and the ratio of vacant lots.

Ratio Foreclosed

plt.hist(com_data['ratio_foreclosed'])
plt.title("Histogram of Ratio Foreclosed")
plt.xlabel("Percentage")
Text(0.5, 0, 'Percentage')
smmodel_f = sm.OLS(com_data['ratio_foreclosed'], X)
smfit_f = smmodel_f.fit()
print(smfit_f.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:       ratio_foreclosed   R-squared:                       0.013
Model:                            OLS   Adj. R-squared:                  0.006
Method:                 Least Squares   F-statistic:                     1.957
Date:                Mon, 16 Dec 2019   Prob (F-statistic):             0.0996
Time:                        06:01:35   Log-Likelihood:                -1903.6
No. Observations:                 623   AIC:                             3817.
Df Residuals:                     618   BIC:                             3839.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const           5.0291      0.468     10.739      0.000       4.109       5.949
perc_green     -1.8366      1.199     -1.532      0.126      -4.191       0.518
perc_blue       1.1273      0.681      1.654      0.099      -0.211       2.465
perc_yellow     0.7798      0.660      1.181      0.238      -0.517       2.077
perc_red        0.4079      0.704      0.580      0.562      -0.974       1.790
==============================================================================
Omnibus:                     1113.143   Durbin-Watson:                   1.902
Prob(Omnibus):                  0.000   Jarque-Bera (JB):           993179.722
Skew:                          11.405   Prob(JB):                         0.00
Kurtosis:                     197.268   Cond. No.                         7.06
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Here, the probability of the F-statistic is not less than 0.05, so we cannot reject the null hypothesis of no correlation.

Ratio Sales

plt.hist(com_data['ratio_sales'])
plt.title("Histogram of Ratio Sales")
plt.xlabel("Percentage")
Text(0.5, 0, 'Percentage')
smmodel_s = sm.OLS(com_data['ratio_sales'], X)
smfit_s = smmodel_s.fit()
print(smfit_s.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:            ratio_sales   R-squared:                       0.022
Model:                            OLS   Adj. R-squared:                  0.016
Method:                 Least Squares   F-statistic:                     3.534
Date:                Mon, 16 Dec 2019   Prob (F-statistic):            0.00730
Time:                        06:01:39   Log-Likelihood:                 1565.8
No. Observations:                 623   AIC:                            -3122.
Df Residuals:                     618   BIC:                            -3099.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const           0.0081      0.002      4.508      0.000       0.005       0.012
perc_green     -0.0075      0.005     -1.630      0.104      -0.016       0.002
perc_blue      -0.0080      0.003     -3.088      0.002      -0.013      -0.003
perc_yellow    -0.0082      0.003     -3.246      0.001      -0.013      -0.003
perc_red       -0.0081      0.003     -3.010      0.003      -0.013      -0.003
==============================================================================
Omnibus:                     1527.289   Durbin-Watson:                   1.084
Prob(Omnibus):                  0.000   Jarque-Bera (JB):          7754833.370
Skew:                          22.764   Prob(JB):                         0.00
Kurtosis:                     547.673   Cond. No.                         7.06
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Median Sales Price

plt.hist(com_data['median_sale_price'])
plt.title("Histogram of Median Sales Prices")
plt.xlabel("Dollars")
Text(0.5, 0, 'Dollars')
smmodel_p = sm.OLS(com_data['median_sale_price'], X)
smfit_p = smmodel_p.fit()
print(smfit_p.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:      median_sale_price   R-squared:                       0.152
Model:                            OLS   Adj. R-squared:                  0.147
Method:                 Least Squares   F-statistic:                     27.78
Date:                Mon, 16 Dec 2019   Prob (F-statistic):           3.12e-21
Time:                        06:01:43   Log-Likelihood:                -8006.8
No. Observations:                 623   AIC:                         1.602e+04
Df Residuals:                     618   BIC:                         1.605e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const        1.107e+05   8415.430     13.154      0.000    9.42e+04    1.27e+05
perc_green   1.501e+05   2.15e+04      6.967      0.000    1.08e+05    1.92e+05
perc_blue     211.1686   1.22e+04      0.017      0.986   -2.38e+04    2.43e+04
perc_yellow -5.917e+04   1.19e+04     -4.986      0.000   -8.25e+04   -3.59e+04
perc_red    -1.157e+04   1.26e+04     -0.915      0.360   -3.64e+04    1.33e+04
==============================================================================
Omnibus:                      241.677   Durbin-Watson:                   1.656
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              944.313
Skew:                           1.787   Prob(JB):                    8.81e-206
Kurtosis:                       7.858   Cond. No.                         7.06
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Both ratio sales and the median sales price had probabilities less than 0.05, so the model suggests that both categories are strongly correlated with the redlining map.

Of further note, in the graphs of the data above, all the categories are heavily skewed to the right. In an effort to combat this and improve the accuracy of the models, we will now look at the log’s of each category.

Normalized Data Analysis

Below is the same map as above but with each category transformed using log. This has the added benefit of making the variations more distinguishable on the map.

map_dl = folium.Map(location=[39.29, -76.61], zoom_start=11)
folium.Choropleth(
    geo_data=census_geo[['GEOID10', 'geometry']],
    name='Log of Ratio of Vacancies',
    data=census_geo,
    columns=['GEOID10', 'log_ratio_vacant'],
    key_on='feature.properties.GEOID10',
    fill_color='Blues',
    fill_opacity=0.5,
    line_opacity=0.7,
    legend_name='Log of Ratio of Vacancies',
    show=False).add_to(map_dl)
folium.Choropleth(
    geo_data=census_geo[['GEOID10', 'geometry']],
    name='Log of Ratio of Foreclosures',
    data=census_geo,
    columns=['GEOID10', 'log_ratio_foreclosed'],
    key_on='feature.properties.GEOID10',
    fill_color='OrRd',
    fill_opacity=0.7,
    line_opacity=0.7,
    legend_name='Log og Ratio of Foreclosures',
    show=False).add_to(map_dl)
folium.Choropleth(
    geo_data=census_geo[['GEOID10', 'geometry']],
    name='Log of Ratio of Sales',
    data=census_geo,
    columns=['GEOID10', 'log_ratio_sales'],
    key_on='feature.properties.GEOID10',
    fill_color='Purples',
    fill_opacity=0.7,
    line_opacity=0.7,
    legend_name='Log of Ratio of Sales',
    show=False).add_to(map_dl)
folium.Choropleth(
    geo_data=census_geo[['GEOID10', 'geometry']],
    name='Log of Median Sales Price',
    data=census_geo,
    columns=['GEOID10', 'log_median_sale_price'],
    key_on='feature.properties.GEOID10',
    fill_color='BuGn',
    fill_opacity=0.5,
    line_opacity=0.7,
    legend_name='Log of Median Sales Price',
    show=False).add_to(map_dl)
folium.GeoJson(redline_data, name='Redlining Map', style_function=lambda feature: {
    'fillColor': feature['properties']['color'],
    'color': feature['properties']['color'],
    'weight': 0.7,
    'fillOpacity': 0.3,
}).add_to(map_dl)
folium.LayerControl().add_to(map_dl)
map_dl

Log of Ratio Vacant

plt.hist(com_data['log_ratio_vacant'])
plt.title("Histogram of the Log of Ratio Sales")
plt.xlabel("Log of Percentage")
Text(0.5, 0, 'Log of Percentage')
smmodel_vl = sm.OLS(com_data['log_ratio_vacant'], X)
smfit_vl = smmodel_vl.fit()
print(smfit_vl.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:       log_ratio_vacant   R-squared:                       0.065
Model:                            OLS   Adj. R-squared:                  0.059
Method:                 Least Squares   F-statistic:                     10.73
Date:                Mon, 16 Dec 2019   Prob (F-statistic):           2.07e-08
Time:                        06:01:55   Log-Likelihood:                -933.32
No. Observations:                 623   AIC:                             1877.
Df Residuals:                     618   BIC:                             1899.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const           1.4424      0.099     14.620      0.000       1.249       1.636
perc_green     -0.8235      0.253     -3.261      0.001      -1.319      -0.328
perc_blue      -0.0570      0.144     -0.397      0.691      -0.339       0.225
perc_yellow     0.0886      0.139      0.637      0.524      -0.185       0.362
perc_red        0.5934      0.148      4.003      0.000       0.302       0.884
==============================================================================
Omnibus:                       32.017   Durbin-Watson:                   1.835
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               13.034
Skew:                          -0.006   Prob(JB):                      0.00148
Kurtosis:                       2.292   Cond. No.                         7.06
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

While the probability of the F-statistic went down, we can see in the histogram that the data being modeled is far more normal (in the statistical sense). This leads to a higher accuracy of the model. Of note, this category is bimodal with modes around 0 and 1.75.

Log of Ratio Foreclosed

plt.hist(com_data['log_ratio_foreclosed'])
plt.title("Histogram of the Lof of Ratio Sales")
plt.xlabel("Log of Percentage")
Text(0.5, 0, 'Log of Percentage')
smmodel_fl = sm.OLS(com_data['log_ratio_foreclosed'], X)
smfit_fl = smmodel_fl.fit()
print(smfit_fl.summary())
                             OLS Regression Results                             
================================================================================
Dep. Variable:     log_ratio_foreclosed   R-squared:                       0.069
Model:                              OLS   Adj. R-squared:                  0.063
Method:                   Least Squares   F-statistic:                     11.38
Date:                  Mon, 16 Dec 2019   Prob (F-statistic):           6.40e-09
Time:                          06:01:58   Log-Likelihood:                -571.03
No. Observations:                   623   AIC:                             1152.
Df Residuals:                       618   BIC:                             1174.
Df Model:                             4                                         
Covariance Type:              nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const           1.3237      0.055     23.999      0.000       1.215       1.432
perc_green     -0.2836      0.141     -2.009      0.045      -0.561      -0.006
perc_blue       0.3854      0.080      4.802      0.000       0.228       0.543
perc_yellow     0.3312      0.078      4.258      0.000       0.178       0.484
perc_red        0.1142      0.083      1.377      0.169      -0.049       0.277
==============================================================================
Omnibus:                       36.192   Durbin-Watson:                   1.726
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              105.370
Skew:                          -0.205   Prob(JB):                     1.32e-23
Kurtosis:                       4.973   Cond. No.                         7.06
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Here we can see a big improvement with the transformed data. Before the model was not statistically significant, but afterwards we get a probability of 10e-9, which is significant.

Log of Ratio Sales

plt.hist(com_data['log_ratio_sales'])
plt.title("Histogram of the Log of Ratio Sales")
plt.xlabel("Log of Percentage")
Text(0.5, 0, 'Log of Percentage')
smmodel_sl = sm.OLS(com_data['log_ratio_sales'], X)
smfit_sl = smmodel_sl.fit()
print(smfit_sl.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:        log_ratio_sales   R-squared:                       0.211
Model:                            OLS   Adj. R-squared:                  0.206
Method:                 Least Squares   F-statistic:                     41.38
Date:                Mon, 16 Dec 2019   Prob (F-statistic):           9.48e-31
Time:                        06:02:01   Log-Likelihood:                -1298.3
No. Observations:                 623   AIC:                             2607.
Df Residuals:                     618   BIC:                             2629.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const          -6.0334      0.177    -34.037      0.000      -6.381      -5.685
perc_green     -1.8539      0.454     -4.086      0.000      -2.745      -0.963
perc_blue      -1.8875      0.258     -7.318      0.000      -2.394      -1.381
perc_yellow    -3.0198      0.250    -12.081      0.000      -3.511      -2.529
perc_red       -2.6485      0.266     -9.945      0.000      -3.172      -2.126
==============================================================================
Omnibus:                      244.622   Durbin-Watson:                   2.081
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1171.674
Skew:                           1.722   Prob(JB):                    3.75e-255
Kurtosis:                       8.768   Cond. No.                         7.06
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Here we see another improvement, from significant to incredibly significant.

Log of Median Sales Price

plt.hist(com_data['log_median_sale_price'])
plt.title("Histogram of the Log of the Median Sales Price")
plt.xlabel("Log of Dollars")
Text(0.5, 0, 'Log of Dollars')
smmodel_pl = sm.OLS(com_data['log_median_sale_price'], X)
smfit_pl = smmodel_pl.fit()
print(smfit_pl.summary())
                              OLS Regression Results                             
=================================================================================
Dep. Variable:     log_median_sale_price   R-squared:                       0.053
Model:                               OLS   Adj. R-squared:                  0.047
Method:                    Least Squares   F-statistic:                     8.589
Date:                   Mon, 16 Dec 2019   Prob (F-statistic):           9.48e-07
Time:                           06:02:04   Log-Likelihood:                -1372.9
No. Observations:                    623   AIC:                             2756.
Df Residuals:                        618   BIC:                             2778.
Df Model:                              4                                         
Covariance Type:               nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const          10.4095      0.200     52.101      0.000      10.017      10.802
perc_green      1.9083      0.511      3.731      0.000       0.904       2.913
perc_blue       1.0123      0.291      3.482      0.001       0.441       1.583
perc_yellow     0.0931      0.282      0.331      0.741      -0.460       0.646
perc_red       -0.1684      0.300     -0.561      0.575      -0.758       0.421
==============================================================================
Omnibus:                      501.837   Durbin-Watson:                   1.942
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             7321.817
Skew:                          -3.667   Prob(JB):                         0.00
Kurtosis:                      18.108   Cond. No.                         7.06
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

This category behaved similarly to ratio vacant under the transformation and while the data is more normal produced a lower probability. However, looking at the graph of the data, there is a cluster of values at 0. I believe that this is from block groups where there was no recorded median sales price (either due to the lack of residential sales or the data was not recorded). Thus, I reran the analysis below with those zeros removed.

Price Data Reanalyzed

median_zero_count = 0
for index,row in com_data.iterrows():
    if row['median_sale_price'] == 0:
        median_zero_count += 1
print(median_zero_count)
22

There are 22 districts that have a listed median sale price of 0

To preserve the original data, I make a copy of the data DataFrame for this analysis. Then I replace each 0 in median sales price with a NaN and remove all rows with a NaN.

data_price = com_data.copy()
data_price['median_sale_price'] = data_price['median_sale_price'].apply(lambda x: x if (x>0) else np.nan)
data_price = data_price.dropna()
price_geo = gpd.GeoDataFrame(data_price, geometry=data_price['geometry'])
price_geo.crs = {'init': 'epsg:4269'}

Once again the map of Baltimore City with the redlining map overlay but now with the median sales price and transformed median sales price data.

map_dp = folium.Map(location=[39.29, -76.61], zoom_start=11)
folium.Choropleth(
    geo_data=price_geo[['GEOID10', 'geometry']],
    name='Median Sales Price',
    data=price_geo,
    columns=['GEOID10', 'median_sale_price'],
    key_on='feature.properties.GEOID10',
    fill_color='BuGn',
    fill_opacity=0.5,
    line_opacity=0.7,
    legend_name='Median Sales Price',
    show=False).add_to(map_dp)
folium.Choropleth(
    geo_data=price_geo[['GEOID10', 'geometry']],
    name='Log of Median Sales Price',
    data=price_geo,
    columns=['GEOID10', 'log_median_sale_price'],
    key_on='feature.properties.GEOID10',
    fill_color='BuGn',
    fill_opacity=0.5,
    line_opacity=0.7,
    legend_name='Log of Median Sales Price',
    show=False).add_to(map_dp)
folium.GeoJson(redline_data, name='Redlining Map', style_function=lambda feature: {
    'fillColor': feature['properties']['color'],
    'color': feature['properties']['color'],
    'weight': 0.7,
    'fillOpacity': 0.3,
}).add_to(map_dp)
folium.LayerControl().add_to(map_dp)
map_dp

Cleaned Median Sales Price

plt.hist(data_price['median_sale_price'])
plt.title("Histogram of the Cleaned Median Sales Price")
plt.xlabel("Dollars")
Text(0.5, 0, 'Dollars')
Xp = sm.add_constant(data_price[['perc_green', 'perc_blue', 'perc_yellow', 'perc_red']])
smmodel_pc = sm.OLS(data_price['median_sale_price'], Xp)
smfit_pc = smmodel_pc.fit()
print(smfit_pc.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:      median_sale_price   R-squared:                       0.164
Model:                            OLS   Adj. R-squared:                  0.159
Method:                 Least Squares   F-statistic:                     29.29
Date:                Mon, 16 Dec 2019   Prob (F-statistic):           2.95e-22
Time:                        06:02:17   Log-Likelihood:                -7719.5
No. Observations:                 601   AIC:                         1.545e+04
Df Residuals:                     596   BIC:                         1.547e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const        1.222e+05   8789.626     13.906      0.000    1.05e+05    1.39e+05
perc_green   1.373e+05   2.16e+04      6.357      0.000    9.49e+04     1.8e+05
perc_blue    -1.15e+04   1.25e+04     -0.917      0.359   -3.61e+04    1.31e+04
perc_yellow -7.149e+04   1.21e+04     -5.889      0.000   -9.53e+04   -4.76e+04
perc_red    -1.574e+04   1.32e+04     -1.197      0.232   -4.16e+04    1.01e+04
==============================================================================
Omnibus:                      240.040   Durbin-Watson:                   1.665
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              974.820
Skew:                           1.825   Prob(JB):                    2.09e-212
Kurtosis:                       8.060   Cond. No.                         7.17
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
C:\Users\Szabo\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py:2389: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
  return ptp(axis=axis, out=out, **kwargs)

Log of the Cleaned Median Sales Price

plt.hist(data_price['log_median_sale_price'])
plt.title("Histogram of the Log of the Cleaned Median Sales Price")
plt.xlabel("Log of Dollars")
Text(0.5, 0, 'Log of Dollars')
smmodel_pcl = sm.OLS(data_price['log_median_sale_price'], Xp)
smfit_pcl = smmodel_pcl.fit()
print(smfit_pcl.summary())
                              OLS Regression Results                             
=================================================================================
Dep. Variable:     log_median_sale_price   R-squared:                       0.226
Model:                               OLS   Adj. R-squared:                  0.221
Method:                    Least Squares   F-statistic:                     43.46
Date:                   Mon, 16 Dec 2019   Prob (F-statistic):           5.10e-32
Time:                           06:02:20   Log-Likelihood:                -738.08
No. Observations:                    601   AIC:                             1486.
Df Residuals:                        596   BIC:                             1508.
Df Model:                              4                                         
Covariance Type:               nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const          11.4997      0.079    145.093      0.000      11.344      11.655
perc_green      0.6971      0.195      3.580      0.000       0.315       1.080
perc_blue      -0.0880      0.113     -0.779      0.436      -0.310       0.134
perc_yellow    -1.0545      0.109     -9.633      0.000      -1.269      -0.840
perc_red       -0.4922      0.119     -4.150      0.000      -0.725      -0.259
==============================================================================
Omnibus:                        7.704   Durbin-Watson:                   1.596
Prob(Omnibus):                  0.021   Jarque-Bera (JB):                6.665
Skew:                           0.188   Prob(JB):                       0.0357
Kurtosis:                       2.647   Cond. No.                         7.17
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The removals of the zeros greatly increased the accuracy of the model, resulting in a far more significant result.

Conclusions

While it was exciting to get highly significant models for each category, it is evidence of a darker story. The HOLC was terminated in 1953 yet its legacy lives on. Despite the nearly 70 years that have passed since then, and almost a hundred since the making of the HOLC maps, the changes have small enough that the grading done at the time is an incredibly good predictor of current housing conditions in those regions. However, all hope is not lost. Working for the coefficients of each model, the greatest negative indicator is not the grade ā€œDā€ regions, but the yellow, grade ā€œCā€ regions. Upon further analysis of the maps, there are a number of regions that were red on the redlining map but are not in the bottom quarter or areas on any of the statistics. A further study would need to be made regarding the processes that changed these regions. But that is not to say that there is not still work to be done. The red and yellow regions still tend to be worse off than the green and blue regions. This analysis also does not cover the movement of racial groups over the intervening 85 odd years. I would not count it as an improvement if the housing conditions were forcibly relocating the original inhabitants.